Skip to content

[WIP] [DSV4] Quantization Support#41276

Draft
kylesayrs wants to merge 3 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/deepseek-ct
Draft

[WIP] [DSV4] Quantization Support#41276
kylesayrs wants to merge 3 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/deepseek-ct

Conversation

@kylesayrs
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs commented Apr 29, 2026

DeepSeek-V4-Flash-NVFP4-FP8

Model Optimizations

This model was obtained by using the following branch with LLM Compressor: vllm-project/llm-compressor#2647

Deployment

vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"

Evaluation

python tests/evals/gsm8k/gsm8k_eval.py
Results:
Accuracy: 0.910
Invalid responses: 0.000
Total latency: 173.006 s
Questions per second: 7.624
Total output tokens: 116217
Output tokens per second: 671.752

For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack

@kylesayrs kylesayrs marked this pull request as draft April 29, 2026 19:14
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the deepseek Related to DeepSeek models label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DeepseekV4 model implementation by adding a packed_modules_mapping for fused layers and implementing a safe initialization for scale_fmt that defaults to 'ue8m0' when the quantization configuration is missing or not a dictionary. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pathway is likely going to continue seeing multiple updates in the next few weeks. Would be good to add some form of smoke test

@kylesayrs
Copy link
Copy Markdown
Contributor Author

FYI I'm seeing a slight accuracy loss with the model, I've ruled out output_dtype as the cause in #41533, which makes me suspect that the cause is the quantization of the indexer/compressor wkv weights. Currently working on updating the checkpoint to skip this quantization, will post accuracy evaluations.

kylesayrs added 2 commits May 7, 2026 18:53
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs force-pushed the kylesayrs/deepseek-ct branch from f5fc438 to 322ca21 Compare May 7, 2026 22:53
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants